Introduction

Ruta chalepensis L., known as “fringed rue”, is one of the most spread medicinal plants of the Ruta genus. It is an evergreen subshrub native to Mediterranean region; however, it may be found in several other parts of the world (Yoshida et al. 2019). R. chalepensis is among the richest sources of different important natural products e.g. flavonoids, alkaloids, saponins, coumarins, cardiac glycosides, terpenoids, tannins, and anthraquinones (Emam et al. 2010; Al-Majmaie et al. 2019). The plant is largely used in the traditional medicine in several different countries including Saudi Arabia (Günaydin and Savci 2005). Different parts of this plant are commonly used for the treatment of menstrual and labor pains, eye problems, some nervous disease, and intestinal and stomach problems (Loizzo et al. 2018). In some countries, the plant is used to enhance the flavor of some food and drinks. Because of its high content of flavonoids, especially rutin (Loizzo et al. 2018), R. chalepensis attracted many studies for its potential medical importance.

Rutin is a major flavonoid found in R. chalepensis with a wide number of pharmacological activities (Loizzo et al. 2018). In central nervous system, rutin showed protective effect against brain ischemia and is useful in the treatment of Alzheimer’s disease by inhibiting cytotoxicity of β-amyloid oligomeric (Ganeshpurkar and Saluja 2017; Pan et al. 2019). Rutin has significant antibacterial, antimalarial, as well as anti-diabetic potential, whereas application of rutin on human leukemia cell line (HL-60) showed a significant reduction in tumor size indicates its potential antileukemic effect (Ganeshpurkar and Saluja 2017). Interestingly, rutin was reported to have ameliorative effects against toxicity induced by cancer chemotherapy including cisplatin (Almutairi et al. 2017; Jahan et al. 2018), idarubicin (Deihimi et al. 2017), and methotrexate (Gautam et al. 2016). Moreover, rutin showed hepatoprotective (Khan et al. 2012a; Hafez et al. 2015) and nephroprotective (Khan et al. 2012b; Azevedo et al. 2013; Duarte et al. 2014) activities. The major plant source of rutin production is Fagopyrum tataricum Gaertn. Using R. chalepensis plants for different medicinal purposes needs more understanding of the genes responsible for the biosynthesis of different chemical compounds especially flavonoids e.g. rutin. Until now, rutin biosynthesis was studied in some plants e.g. Asparagus officinalis and F. tataricum (Huang et al. 2016; Yi et al. 2019). However, there are no studies available regarding the biosynthesis pathway of rutin in Ruta genus, especially R. chalepensis.

Next-generation sequencing (NGS) has greatly increased the opportunity to study the plant transcriptome to assess the genetic changes behind the biosynthesis of active compounds inside the plant. NGS has improved the quality of nucleic acids sequencing and decreased its cost very significantly. Transcriptomic studies using NGS produce a huge amount of transcript sequences data which can be used for different research purposes. RNA-sequencing (RNA-Seq) is one of the most important applications of NGS. This technique is typically used to quantify the changes in the expression levels of different transcripts under different conditions, at different developmental stages, or in different organs/tissues of the plant (Wang et al. 2009). RNA-Seq technique was utilized to study the whole transcriptome of different plants and to identify the expression levels of genes involved in biosynthesis of their active components in different plat organs.

One of the major obstacles hindering the application of RNA-Seq workflow on different organisms is the absence of reference genome, especially in organisms having complex genomes with high levels of repetitive DNA such as plants (Kerr et al. 2019). However, development of de novo transcriptome assembly algorithms and tools enabled researchers to perform different kind of analyses using RNA-Seq data in non-model organisms with no reference genome available (Moreton et al. 2016; Ungaro et al. 2017). Moreover, de novo assembly of transcriptome of organisms with or without reference genome provides deep insights on the transcriptional changes and, thus, biological variations in different organs or developmental stages of the non-model species. De novo transcriptomic studies aimed to examine the genes involved in biosynthesis of active compounds were reported on different plants species e.g. Agave hybrid H11648 (Huang et al. 2019), Silene uralensis and S. schafta (Bertrand et al. 2018), Lithospermum officinale (Rai et al. 2018), and S. dioica (Cegan et al. 2017). This study was performed to generate de novo assembly of R. chalepensis whole transcriptome based on data obtained from different organs and functionally annotate the transcriptome to identify genes involved in rutin biosynthesis. Our results would lay the foundation for further genomics and transcriptomic studies on R. chalepensis and help in dentification and genetic modification of genes involved in the biosynthesis of secondary metabolites in medicinal plants.

Materials and Methods

Plant material

R. chalepensis seedlings were obtained from a nursery in Riyadh, Saudi Arabia. Seedlings were grown in a growth chamber affiliated with our lab in the Department of Botany and Microbiology, College of Sciences, King Saud University. They were grown for 3 months at 25 ± 2°C with 50 µmol m^-2 s^-1 light provided by cool white fluorescent lamps for 16/8 h day/night. Seedlings were maintained in 30-cm diameter pots containing German potting soil mix and irrigated day after day with 200 mL of distilled water. Plants were fertilized monthly with NPK (20-5-5+TE) fertilizer.

After 3 months of growth, R. chalepensis plants were carefully removed from the soil to avoid damaging roots as possible. Leaves, stems, and roots were separated from each other and rinsed with ultrapure Milli-Q water (Elix^® Advantage, Merck KGaA, Darmstadt, Germany) three times at least. Roots were washed in a sink with tap water to remove all the remaining soil particles before rinsing with Milli-Q water. After complete air drying, 1 g of fresh tissues were collected from leaves, stems, and roots and instantly frozen in liquid nitrogen (-196°C) for further RNA isolation.

RNA isolation

RNA was isolated from three biological replicates of leaves, stems, and roots tissues collected from three different R. chalepensis plants using RNeasy® Plant Mini Kit (Cat No. 74904, QIAGEN, GmbH, D-40724 Hilden, Germany). Preliminary quality check (QC) of isolated RNA was performed using Agarose Gel Electrophoresis. RNA quantity and purity were checked using Thermo Scientific NanoDrop 2000c UV-Vis Spectrophotometer (Thermo Fisher Scientific Inc., MA, USA). RNA integrity of the samples was checked using Agilent 2100 Bioanalyzer (Agilent Technologies, CA, USA).

Library construction and RNA sequencing

After assuring the quality of RNA samples, RNA libraries were constructed using NEBNext® Ultra™ RNA Library Prep Kit for Illumina® (New England Biolabs, MA, USA). The quality of the constructed libraries was then checked using Qubit® 2.0 Fluorometer (Thermo Fisher Scientific Inc., MA, USA), Agilent 2100 Bioanalyzer, and quantitative-PCR (Q-PCR). The qualified libraries were fed into NovaSeq 6000 sequencers (Illumina, CA, USA) after pooling according to its effective concentration and expected data volume. Sequencing was performed to produce 6 Gb of data per sample using Illumina PE sequencing methods with a length of 150 bp with an insertion size of 250 ~ 300 bp.

Data preprocessing

After obtaining FASTQ sequence files, their quality was analyzed using FASTQC v0.11.9 software (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Raw data were filtered to remove reads containing adapters, reads containing more than 10% of N bases, and low-quality reads. Cutadapt tool v2.8 (Martin 2011) was used to filter the reads. RNA-seq error Corrector (Rcorrector) tool (Song and Florea 2015) was used to perform k-mer correction of filtered PE FASTQ files using default k-mer length. Filtered and corrected FASTQ files (clean reads) were then used for de novo transcriptome assembly and functional annotation.

De novo transcriptome assembly

Since there is no available reference genome for R. chalepensis plant, de novo transcriptome assembly was performed to obtain a reference that could be used for further functional annotation of potential genes found in this plant. For preparation of reads for assembly using Linux command line tools, all the left clean reads from different tissues and replicates were combined in one file and, similarly, all right reads were combined in one file for assembly. Trinity assembler v2.8.4 (Grabherr et al. 2011) was used for de novo assembly of RNA-Seq data obtained in the current study in PE mode with in silico normalization and minimum assembled contig list of 200. Quality of the assembled transcriptome was checked via counting number of assembled transcripts that seemed to be full-length or nearly full-length, examining read representation in the assembly, OrthoDB’s sets of Benchmarking Universal Single-Copy Orthologs “BUSCO” tool suite (http://busco.ezlab.org/) (Seppey et al. 2019), and investigating the strand specificity of the generated data.

Functional annotation

Annocript pipeline v2.0.1 (Musacchia et al. 2015) was used to annotate the de novo assembled R. chalepensis whole transcriptome. All the assembled transcripts were searched against several databases using different BLAST tools. BLASTX was used to search the assembled transcripts against SwissProt and UniRef90 databases. RPSTBLASTN was used to reveal the Conserved Domains within each transcript via searching transcripts against Conserved Domains Database (CDD; https://www.ncbi.nlm.nih.gov/cdd/). For translation of nucleotide sequences into their respective amino acids, Virtual Ribosome (dna2pep) v1.1 (Wernersson 2006) was used.

Gene ontology (GO), enzyme, and pathway annotation: GO functional annotation were performed based the results of both proteins annotated against UniRef90 database and domains annotated against CDD database. Similarly, Enzyme Commission (EC) IDs were assigned for each annotated sequence based on SwissProt results. Moreover, annotated sequencing against SwissProt database were assigned their respective pathway information based on pathway database v2019_11 provided by UniProt – SwissProt Protein Knowledgebase (Morgat et al. 2011).

lncRNAs identification: BLASTN was used to search the assembled sequences against RNAcentral database (The Rnacentral Consortium 2018) that contained information from more than 40 databases to identify potential lncRNAs, ribosomal RNAs (rRNAs). Next, coding potential score for each putative lncRNA was calculated using Coding Potential Calculator v2.0 (CPC2) software (Kang et al. 2017). Finally, RNA molecule was identified as lncRNA if it had coding potential score less than 0.5, was longer than 200 bp, and had open reading frame (ORF) shorter than 100 bp.

Microsatellite identification: The microsatellite identification tool MISA (Thiel et al. 2003; Beier et al. 2017) was used to predict potential SSRs in the de novo assembled transcriptome of R. chalepensis. MISA was used to predict mono-nucleotide, di-nucleotide, tri-nucleotide, tetra-nucleotide, penta-nucleotide, and hexa-nucleotide SSRs. Search criteria for identification of potential SSRs were set to 10 repeats for mono-nucleotides, 6 repeats for di-nucleotides, and 5 repeats for each tri-nucleotide, tetra-nucleotide, penta-nucleotide, and hexa-nucleotides.

Transcription factors (TFs) identification: To identify potential TFs in the de novo assembled transcriptome of R. chalepensis, all the assembled genes were searched against Plant Transcription Factor Database (PlantTFDB) v5.0 (Tian et al. 2019). BLASTX tool was used to search assembled nucleotide sequences against TFs protein sequences with an e-value of 1e-3.

Results

RNA quality and quantity

The obtained results showed that all the isolated samples passed the criteria for sequencing as all the samples were pure without any DNA contamination as revealed by electrophoresis. Furthermore, NanoDrop report showed high concentrations ranging from 153 to 718 ng µL^-1. A260/A280 for all samples was above 1.8 except the first replicate of leaves sample (L1), which showed a ratio of 1.75. Similarly, A260/230 for all samples ranged between 1.5 to 1.8 indicating no contamination with chaotropic salts or organic compounds. All the samples showed RIN above 6 indicating high integrity (Table 1).

Data cleaning

Table 1: RNA QC results obtained using NanoDrop and Bioanalyzer

Sample ^a	Sample ID	Conc (ng µL^-1)	Amount (µg)	260/280	260/230	RIN
L1	TZTR181213022	153	3.470	1.75	1.75	6.6
L2	TZTR181213023	436	6.540	2.04	1.76	6.3
L3	TZTR181213024	718	10.770	2.10	1.61	6.5
S1	TZTR181213025	514	7.710	2.12	1.81	8.5
S2	TZTR181213026	628	9.420	2.08	1.78	9.6
S3	TZTR181213029	452	6.780	2.11	1.81	9.0
R1	TZTR181213028	187	2.905	2.00	1.57	6.2
R2	TZTR181213027	321	4.815	2.02	1.80	9.0
R3	TZTR181213030	180	2.500	1.85	1.51	7.3

^aL1, L2, L3: leaves; S1, S2, S3: stems; R1, R2, R3: roots

Table 2: Detailed statistics for the quality of sequencing data

Sample ^a	Raw reads	Cleaned reads	Raw bases (Gb)	Clean bases (Gb)	Effective rate (%)^b	GC (%)
L1	42,008,314	38,359,520	6.3	5.8	91.31	46
L2	47,128,552	42,897,490	7.1	6.4	91.02	46
L3	44,570,714	40,629,030	6.7	6.1	91.16	46
S1	49,951,230	45,807,238	7.5	6.9	91.70	45
S2	59,231,140	53,585,430	8.9	8.0	90.47	45
S3	54,594,866	49,705,054	8.2	7.5	91.04	45
R1	43,498,252	39,333,020	6.5	5.9	90.42	44
R2	42,389,424	37,992,108	6.4	5.7	89.63	45
R3	42,943,700	38,664,740	6.4	5.8	90.04	45
Total	426,316,192	386,973,630	63.9	58.0	-	-

^aL1, L2, L3: leaves; S1, S2, S3: stems; R1, R2, R3: roots

^bEffective rate = clean reads / raw reads × 100

Fig. 1: Length distribution of R. chalepensis (a) de novo assembled transcripts and (b) longest ORF of each transcript

The total amount of raw data generated from three different replicates for each of the studied tissues equaled 63.9 Gb; however, after cleaning (filtration and correction) total kept data was 58.0 Gb. The total generated raw reads were 213 million PE reads (426 million reads). The total kept reads after cleaning reached roughly 193.5 million PE reads (387 million reads). The average effective rate calculated as number of cleaned reads over number of raw reads was 90.76% (Table 2). No major changes were observed in GC contents in different sequence replicates for the studied tissues after cleaning process. In all cases, no change was observed in GC content after filtration and correction except in the third replicate of roots (R3) where it was increased from 44 to 45% in raw data and cleaned data, respectively.

Table 3: Summary of de novo assembly of R. chalepensis whole transcriptome using Trinity

Statistic	Total	Based on all genes	Based on longest isoform
Assembled genes	145,018
Assembled transcripts	254,685
GC (%)	42.60
Median contig length		584	347
Average contig length		1,183.54	691.74
Total assembled bases		301,430,269	106,540,322
Contig N10		4,938	4,461
Contig N20		3,880	3,297
Contig N30		3,215	2,542
Contig N40		2,706	1,872
Contig N50		2,280	1,221

Fig. 2: The closest organisms (n = 10) to R. chalepensis based on similar sequences identified in (a) SwissProt and (b) UniRef90 databases

De novo transcriptome assembly

A total of 154,018 genes (unigenes) were assembled with GC content of 42.60%. These unigenes were represented by 254,685 transcripts (isoforms). The average length of assembled contigs based on all transcript contigs was 1,183.54 bp, while the average length based on longest isoform per gene was 691.74 bp. Similarly, the median length of assembled contigs was 584 and 347 bp based on all contigs and longest isoform per gene, respectively. The contig N50 value based on all assembled genes was 2280 bp (Table 3). The results of assembly quality measures showed high quality of the resulting transcriptome (supplementary material S1).

Functional annotation

Identification of coding transcripts: Out of 254,685 de novo assembled transcripts, 202,961 (79.7%) had at least one BLAST result against either UniRef90 or SwissProt databases. The lengths of all searched sequences ranged from 184 to 15,955 bp with the majority ranging from 200 to 5,000 bp (Fig. 1a). Lengths of longest ORFs ranged from 17 to 5,126 bp with mean length of roughly 204 bp. Majority of longest ORFs (93.4%) had lengths between 50 and 1,000 bp (Fig. 1b). A total of 202,961 de novo assembled transcripts had at least one blast result after searching against either SwissProt or UniRef90 databases. Out of them, 139,400 transcripts had results (hits) in agreement with longest ORF. Aligning the assembled transcripts against SwissProt database led to annotation of 138,244 assembled transcripts, of them 67,634 and 70,610 were aligned on positive and negative strands, respectively. On the other hand, 198,704 assembled transcripts were aligned to different hits in UniRef90 database with 96,979 alignments on the positive strand and 101,725 alignments on the negative strand.

Identification of the closest organisms: Based on alignments obtained in SwissProt database, Arabidopsis thaliana, Oryza sativa subsp. japonica and Nicotiana tabacum were the closest organism to R. chalepensis with 69,175, 4,487 and 1,329 similar proteins, respectively (Fig. 2a). Similarly, different species belong to Citrus genus including orange (C. sinensis), clementine (C. clementina), and satsuma mandarin (C. unshiu) were the closest organisms to R. chalepensis with a total of 78,431 similar sequences (Fig. 2b) based on UniRef90 database annotations.

Protein conserved domains annotation: The obtained results showed the presence of 143,262 conserved domains. Occurrence percentage of the top 15 most occurred conserved domain is shown in Fig. 3. It was found that the most prominent conserved domains were protein kinase domain (pfam00069; 2.83%), protein tyrosine kinase (pfam07714; 1.72%), PPR repeat family (pfam13041; 0.85%), RNA recognition motif (pfam00076; 0.74%), and cytochrome P450 (pfam00067; 0.65%).

GO annotation: Annotated sequences in UniRef90 database revealed that the most enriched BP GO terms in R. chalepensis de novo assembled transcriptome were regulation of transcription (GO:0006355), translation (GO:0006412), and carbohydrate metabolic process (GO:0005975) (Fig. 4). Interestingly, flavonoid biosynthetic process (GO:0009813) and flavonoid metabolic process (GO:0009812) were enriched with 55 and 6 different genes, respectively. Moreover, L-phenylalanine biosynthetic process (GO:0009094), cinnamic acid biosynthetic process (GO:0009800), and flavonol biosynthetic process (GO:0051555) was enriched with 47, 12, and 3 different genes, respectively.

In terms of cellular component (CC), the most enriched terms included integral component of membrane (GO:0016021), nucleus (GO:0005634), and cytoplasm (GO:0005737). Furthermore, the top enriched MF GO terms based on the annotated proteins of R. chalepensis transcriptome were mainly related to binding functions and protein kinase activities (Fig. 4). In this category, phenylalanine ammonia-lyase (PAL) activity (GO:0045548) was enriched by 7 different genes. Four and 12 different genes were found in trans-cinnamate 4-monooxygenase activity (GO:0016710) and 4-coumarate-CoA ligase activity (GO:0016207), respectively. Naringenin-chalcone synthase activity (GO:0016210) and chalcone isomerase activity (GO:0045430) were enriched by 27 and 3 different genes, respectively. Flavonol synthase activity (GO:0045431), flavonol 3-O-glucosyltransferase activity (GO:0047893), and flavonol-3-O-glucoside glucosyltransferase activity (GO:0033838) were enriched by 8, 6 and 4 genes, respectively. Sixteen (16) genes were annotated in the term of flavonoid 3’-monooxygenase activity (GO:0016711). Quercetin 3’-O-glucosyltransferase activity (GO:0080043) were enriched by 46 different genes, respectively.

The most enriched BP GO terms via CD identified in R. chalepensis assembled transcripts were oxidation-reduction process (GO:0055114), protein phosphorylation (GO:0006468), regulation of transcription (GO:0006355), and transmembrane transport (GO:0055085) with percentages of occurrence exceeding 7% (Fig. 5). Interestingly, L-phenylalanine biosynthetic process (GO:0009094) and L-phenylalanine catabolic process (GO:0006559) were enriched by 33 and 20 different genes, respectively. Enriched CC GO terms included membrane (GO:0016020) with 4,743 genes, integral component of membrane (GO:0016021) with 3,713 genes, and ribosome (GO:0005840) with 2,139 genes (Fig. 5). ATP binding (GO:0005524), protein kinase activity (GO:0004672), and protein binding (GO:0005515) were the most enriched MF terms in the conserved domains of R. chalepensis.

Fig. 3: The most occurred conserved domains (n = 15) identified in the de novo assembled transcriptome of R. chalepensis

Fig. 4: The top 10 enriched biological process (BP), cellular component (CC), and molecular function (MF) GO terms by the de novo assembled transcripts of R. chalepensis

Pathway enrichment: Pathway enrichment was analyzed based on the annotation results against SwissProt database on two different levels of pathways (Fig. 6). The most enriched level 1 pathways were mainly related to fundamental processes including protein modification, amino-acid biosynthesis, and lipid metabolism. On that level, flavonoid metabolism pathway was enriched by 31 different de novo assembled genes. Similarly, top enriched level 2 pathways included protein ubiquitination, protein glycosylation, and glycolysis. Interestingly, analysis of level 2 pathway enrichment revealed that flavonoid biosynthesis pathway was enriched by 116 genes (Fig. 6).

Fig. 5: The top 10 enriched biological process (BP), cellular component (CC), and molecular function (MF) GO terms by the identified conserved domains in R. chalepensis

Fig. 6: The most enriched level 1 and 2 pathways by the de novo assembled transcripts of R. chalepensis

Microsatellite identification: A total of 45,367 of SSRs were identified in the de novo assembled transcriptome of R. chalepensis. These SSRs were identified in 27,443 sequences with 10,162 sequences have more than 1 SSR. If the same SSR were repeated within 100 bp of the sequences, it was considered as compound SSR. This rule identified 7,291 SSRs that present in compound formation. The mono-nucleotide SSRs were the most identified with 32769 (72.23%) occurrences, while the penta-nucleotide SSRs were least abundant with only 165 (0.36%) occurrences (Table 4).

Table 4: Number of identified SSRs in the de novo assembled transcriptome of R. chalepensis

SSR	Occurrences	Percentage
Mono-nucleotide	32,769	72.23%
Di-nucleotide	5,495	12.11%
Tri-nucleotide	6,113	13.47%
Tetra-nucleotide	537	1.18%
Penta-nucleotide	165	0.36%
Hexa-nucleotide	288	0.63%
Total	45,367	100.00%

Table 5: Number of transcripts/isoforms in the de novo assembled transcriptome of R. chalepensis that represent each enzyme involved in the biosynthesis of rutin

EC ID	Enzyme description	Abbreviation	Occurrences
4.3.1.24	Phenylalanine ammonia-lyase	PAL	13
1.14.14.91	trans-Cinnamate 4-monooxygenase	C4H	6
6.2.1.12	4-coumarate CoA ligase	4CL	22
2.3.1.74	Chalcone synthase	CHS	29
5.5.1.6	Chalcone isomerase	CHI	6
1.14.11.9	Flavonol synthase	FLS	45
1.14.20.6	Flavanone-3- dioxygenase	F3H	12
1.14.14.82	Flavonoid-3’- monooxygenase	F3'H	13
2.4.1.91	Flavonol 3-O-glucosyltransferase	-	8

Fig. 7: The most represented TFs families in the de novo assembled transcriptome of R. chalepensis

lncRNAs identification: lncRNAs were identified as the RNA molecule that was longer than 200 bp with coding potential score less than 0.5 and longest ORF shorter than 100 bp. Following these rules, 46,213 de novo assembled transcripts were identified as potential lncRNAs. Identified lncRNAs were generally shorter than protein coding transcripts with average transcript length of 813 and 2497 bp, respectively.

TFs identification: A total of 25,312 potential TFs were identified in the de novo assembled transcriptome of R. chalepensis. The most represented TF families were bHLH family with 2,719 genes followed by MYB_related with 1,722 genes, NAC family with 1,642 genes, ERF family with 1,447 genes, C2H2 family with 1292 genes, and C3H family with 1,011 genes. Furthermore, other 6 TFS families were represented by more than 750 different genes (Fig. 7).

Identification of enzymes involved in rutin biosynthesis: A total of 49,270 genes/transcripts were annotated to be associated with at least one enzyme and their respected EC ids were assigned. Enzymes known to be involved in the biosynthesis of rutin were examined. The number of transcripts/isoforms represent each enzyme involved in the biosynthesis of rutin was identified (Table 5). Surprisingly, flavonol-3-O-glucoside L-rhamnosyltransferase enzyme was not represented in any of the de novo assembled transcripts of R. chalepensis. However, 8 different transcripts were annotated as an isoform for a putative UDP-rhamnose rhamnosyltransferase identified in Fragaria ananassa and Pistacia vera.

Discussion

Application of RNA-Seq techniques to identify potential genes involved in biosynthesis of natural products of different plants has been recently increased and well-studied. Using NGS techniques, tons of transcriptomic data can be generated to examine the changes happen in transcription landscape of the plant during different stages or the variation in transcription status between different plant’s organs. However, one of the critical limitations facing RNA-Seq applications is the quality of the generated sequences (Del Fabbro et al. 2013). Therefore, several quality check and control tools has been developed including, FastQC, Cutadapt, and Trimmomatic, etc. In the current study, application of different quality control measures e.g. trimming of reads, removal of reads containing adapters or high N percentage, and removal of low-quality reads significantly enhanced the quality of the generated RNA-Seq data. However, in all these samples, fluctuations in per base sequence content were observed in the first 6 – 7 bases which is common in RNA sequencing because of primer amplification bias and seems not to affect any further analysis of data (Williams et al. 2016).

The de novo assembled transcriptome of R. chalepensis in the current study had a total length of 301 million base pairs (Mb). Based on flowcytometric analysis, R. chalepensis plants were reported to have 40 chromosomes (2n) and a total genome length of 686 Mb (Guerra 1984). In sweet orange (C. sinensis), which is the closest relative to R. chalepensis with whole genome sequence available, less than 50% of the whole genome is transcribed into different kinds of RNA (Xu et al. 2013). Furthermore, our analyses showed that average transcript length is 1,183 bp. This agrees with the average coding transcript length in sweet orange with 1,255 bp (Xu et al. 2013). Further quality check measures were applied to examine the quality of de novo assembled transcriptome of R. chalepensis. It was found that a total of 13,316 proteins in the SwissProt database had covered the de novo assembled transcripts of R. chalepensis by more than 80% of their respective lengths. This could be an indicator for the quality of the assembled transcriptome as most of the transcripts represent nearly full-length proteins. Afterwards, all the generated RNA-Seq reads were aligned back to de novo assembled transcriptome to examine their representation in the transcriptome. The alignment rate for the 9 replicates obtained from the different studied tissues (leaves, stems, and roots) ranged from 96.51 to 97.74%, indicating high representation of reads and high quality of the assembled transcriptome. BUSCO analysis is ideal for quantifications of completeness of transcriptome assembly (Waterhouse et al. 2012; Simão et al. 2015). Analyzing the de novo assembled transcriptome of R. chalepensis with BUSCO revealed that roughly 96% of BUSCOs in the Embryophyta (odb10) dataset were found in in their complete form while only 0.01% was missed. Moreover, analysis of transcriptome quality using strand specificity analysis showed that de novo assembled transcriptome of R. chalepensis is of high-quality and could be used for further analysis.

Functional annotation of the de novo assembled transcriptome of R. chalepensis led to annotation of 79.7% of the assembled transcripts. Based on these annotations, the closest organism to R. chalepensis i.e., the organism that shared the highest number of protein isoforms was A. thaliana based on SwissProt results. This result is reasonable as A. thaliana is the most represented plant in the SwissProt database (Bairoch and Apweiler 2000). On the other hand, based on the UniRef90 results, different species belonging to Citrus genus were identified as the closest organisms to R. chalepensis. According to the APG IV classification system (The Catalogue of Life Partnership 2017), R. chalepensis belongs to the Rutaceae family that includes Citrus genus. Furthermore, P. vera and Acer yangbiense of the family Sapindaceae were amongst the closest organisms to R. chalepensis. Rutaceae and Sapindaceae belong to the same order (Sapindales) according to the APG IV classification system (The Catalogue of Life Partnership 2017).

In the present study, 143,262 protein conserved domains were identified in the de novo assembled transcriptome of R. chalepensis. Protein conserved domains have different roles, especially in the vital processes. For example, protein kinase domain including tyrosine kinase domains have some catalytic functions (Gomperts et al. 2009). In plants, cytochrome P450 plays a critical role in the biosynthesis of secondary metabolites, especially flavonoids (Ayabe and Akashi 2006). Furthermore, it could play prominent roles in production of hormones and defensive compounds (Saxena et al. 2013). A significant number of secondary metabolites that cytochrome P450 involved in their biosynthesis characterized as inhibitors for different pathogens, herbivores, and insects (Kaspera and Croteau 2006).

The GO annotation of R. chalepensis de novo assembled transcriptome in the present study revealed that flavonoid biosynthetic process, flavonoid metabolic process, and flavonol biosynthetic process were enriched. Furthermore, L-phenylalanine and cinnamic acid biosynthetic processes were enriched in de novo assembled transcriptome. Phenylalanine is a precursor for different secondary metabolites including flavonoids (Comino et al. 2007). Furthermore, naringenin chalcone (the main pre-cursor of rutin) is mainly formed from 4-coumaroyl-CoA resulting mainly from cinnamic acid (Panwar et al. 2012). Our results showed that the activities of PAL, trans-cinnamate 4-monooxygenase, 4-coumarate-CoA ligase, naringenin-chalcone synthase, chalcone isomerase, flavonol synthase, and flavonol 3-O-glucosyltransferase were enriched in the de novo assembled transcriptome of R. chalepensis indicating the enrichment of rutin biosynthesis. Rutin in plants is biosynthesized mainly via flavonoid and flavonol biosynthesis pathways (Liu et al. 2019; Yi et al. 2019). Further analysis of pathway enrichment in the current study showed that flavonoid metabolism and flavonoid biosynthesis pathway were enriched by 31 and 116 different de novo assembled genes, respectively.

SSRs are very important tools in biotechnological studies as they provide several polymorphic DNA markers (Tautz 1989) that could be utilized in studies of genomic mapping (Hearne et al. 1992) and phylogenetic analysis (Queller et al. 1993). In the current study, we identified 45,367 SSRs in R. chalepensis with different lengths ranging from 1 to 6 nucleotides.

Analysis of enzymes in relation to the biosynthesis of rutin revealed that all enzymes known to be involved in this process were abundant in the de novo assembled transcriptome of R. chalepensis. Our results showed that 162 different transcripts in relation to rutin biosynthesis exist in R. chalepensis. In Asparagus officinalis, 57 genes were identified to be involved in rutin biosynthesis including C4H, CHS, and F3'H (Yi et al. 2019). Similarly, in tartary buckwheat, 47 genes were identified in relation to biosynthesis of rutin including PAL, CHS, CHI, FLS, and F3'H (Liu et al. 2018).

Conclusion

Whole transcriptome of R. chalepensis plant was sequenced, and de novo assembled using available bioinformatics methods. The results showed that R. chalepensis characterized by diverse and rich transcriptome landscape containing several protein coding and long non-coding RNAs. Several SSRs and TFs were identified in R. chalepensis. Moreover, analysis of assembled transcriptome reveals different genes involved in the biosynthesis of rutin in this plant. Further studies to examine the parts of R. chalepensis in which rutin is mainly biosynthesized and the relation with transcriptomic changes are recommended.

Acknowledgments

The authors are thankful to the Research Supporting Project number (RSP-2020/86), King Saud University, Riyadh, Saudi Arabia.

Author Contributions

EMA and MF planned the experiments, EMA and AAQ performed the lab work, EMA performed the computational analysis and illustrations, EMA, AAA and MF interpreted the results, EMA and MF made the write up, AAA and AAQ reviewed the manuscript.

References

Al-Majmaie S, L Nahar, GP Sharples, K Wadi, SD Sarker (2019). Isolation and antimicrobial activity of rutin and its derivatives from Ruta chalepensis (Rutaceae) growing in Iraq. Rec Nat Prod 13:64–70

Almutairi MM, WA Alanazi, MA Alshammari, MR Alotaibi, AR Alhoshani, SS Al-Rejaie, MM Hafez, OA Al-Shabanah (2017). Neuro-protective effect of rutin against Cisplatin-induced neurotoxic rat model. BMC Compl Altern Med 17:Article 472

Ayabe SI, T Akashi (2006). Cytochrome P450s in flavonoid metabolism. Phytochem Rev 5:271–282

Azevedo MI, AF Pereira, RB Nogueira, FE Rolim, GAC Brito, DVT Wong, RCP Lima-Júnior, R de Albuquerque Ribeiro, ML Vale (2013). The antioxidant effects of the flavonoids rutin and quercetin inhibit oxaliplatin-induced chronic painful peripheral neuropathy. Mol Pain 9; Article 53

Bairoch A, R Apweiler (2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucl Acids Res 28:45–48

Beier S, T Thiel, T Münch, U Scholz, M Mascher (2017). MISA-web: A web server for microsatellite prediction. Bioinformatics 33:2583–2585

Bertrand YJK, A Petri, AC Scheen, M Topel, B Oxelman (2018). De novo transcriptome assembly, annotation, and identification of low-copy number genes in the flowering plant genus Silene (Caryophyllaceae). BioRxiv; Article 290510

Catalogue of Life Partnership (2017). https://www.gbif.org/dataset/7ddf754f-d193-4cc9-b351-99906754a03b (Accessed on: August 21 2020)

Cegan R, V Hudzieczek, R Hobza (2017). De novo transcriptome assembly of heavy metal tolerant Silene dioica. Genom Data 11:118–119

Comino C, S Lanteri, E Portis, A Acquadro, A Romani, A Hehn, R Larbat, F Bourgaud (2007). Isolation and functional characterization of a cDNA coding a hydroxycinnamoyltransferase involved in phenylpropanoid biosynthesis in Cynara cardunculus L. BMC Plant Biol 7; Article 14

Deihimi M, S Moghbelinejad, R Najafipour, K Parivar, M Nassiri-Asl (2017). The effects of rutin on the gene expression of Dazl, Bcl2, and Caspase3 in idarubicin-induced testicular damages in mice. Iran Red Crescent Med J 19; Article e44765

Del Fabbro C, S Scalabrin, M Morgante, FM Giorgi (2013). An extensive evaluation of read trimming effects on Illumina NGS data analysis. PLoS One 8; Article e85024

Duarte R, A Carvalho, D Gadelha, V Braga (2014). Rutin reduces oxidative stress in animals with renovascular hypertension. BMC Proc 8; Article P65

Emam A, M Eweis, M Elbadry (2010). A new furoquinoline alkaloid with antifungal activity from the leaves of Ruta chalepensis L. Drug Discov Ther 4:399–404

Ganeshpurkar A, AK Saluja (2017). The pharmacological potential of rutin. Saudi Pharm J 25:149–164

Gautam R, M Singh, S Gautam, JK Rawat, SA Saraf, G Kaithwas (2016). Rutin attenuates intestinal toxicity induced by Methotrexate linked with anti-oxidative and anti-inflammatory effects. BMC Compl Altern Med 16; Article 99

Gomperts BD, IM Kramer, PER Tatham (2009). Chapter 24 - Protein domains and signal transduction. In: Signal Transduction, 2^nd edn, pp:763–790. Gomperts BD, IM Kramer, PER Tatham (Eds.). Academic Press, San Diego, California, USA

Grabherr MG, BJ Haas, M Yassour, JZ Levin, DA Thompson, I Amit, X Adiconis, L Fan, R Raychowdhury, Q Zeng, Z Chen, E Mauceli, N Hacohen, A Gnirke, N Rhind, F di Palma, BW Birren, C Nusbaum, K Lindblad-Toh, N Friedman, A Regev (2011). Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29:644–652

Guerra MDS (1984). Cytogenetics of Rutaceae. II. Nuclear DNA content. Caryologia 37:219–226

Günaydin K, S Savci (2005). Phytochemical studies on Ruta chalepensis (LAM.) Lamarck. Nat Prod Res 19:203–210

Hafez MM, NO Al-Harbi, AR Al-Hoshani, KA Al-hosaini, SD Al Shrari, SS Al Rejaie, MM Sayed-Ahmed, OA Al-Shabanah (2015). Hepato-protective effect of rutin via IL-6/STAT3 pathway in CCl₄-induced hepatotoxicity in rats. Biol Res 48; Article 30

Hearne CM, S Ghosh, JA Todd (1992). Microsatellites for linkage analysis of genetic traits. Trends Genet 8:288–294

Huang X, M Xiao, J Xi, C He, J Zheng, H Chen, J Gao, S Zhang, W Wu, Y Liang, L Xie, K Yi (2019). De Novo transcriptome assembly of Agave H11648 by Illumina sequencing and identification of cellulose synthase genes in Agave species. Genes 10; Article 103

Huang X, J Yao, Y Zhao, D Xie, X Jiang, Z Xu (2016). Efficient rutin and quercetin biosynthesis through flavonoids-related gene expression in Fagopyrum tataricum Gaertn. hairy root cultures with UV-B irradiation. Front Plant Sci 7; Article 63

Jahan S, A Munawar, S Razak, S Anam, QU Ain, H Ullah, T Afsar, M Abulmeaty, A Almajwal (2018). Ameliorative effects of rutin against cisplatin-induced reproductive toxicity in male rats. BMC Urol 18; Article 107

Kang YJ, DC Yang, L Kong, M Hou, YQ Meng, L Wei, G Gao (2017). CPC2: A fast and accurate coding potential calculator based on sequence intrinsic features. Nucl Acids Res 45:W12-W16

Kaspera R, R Croteau (2006). Cytochrome P450 oxygenases of taxol biosynthesis. Phytochem Rev 5:433–444

Kerr SC, F Gaiti, M Tanurdzic (2019). De Novo plant transcriptome assembly and annotation using Illumina RNA-Seq Reads. In: Plant Long Non-Coding RNAs: Methods and Protocols, pp:265–275. Chekanova JA, H-LV Wang (eds.). Springer, New York, USA

Khan RA, MR Khan, S Sahreen (2012a). CCl₄-induced hepatotoxicity: Protective effect of rutin on p53, CYP2E1 and the antioxidative status in rat. BMC Compl Altern Med 12; Article 178

Khan RA, MR Khan, S Sahreen (2012b). Protective effects of rutin against potassium bromate induced nephrotoxicity in rats. BMC Compl Altern Med 12; Article 204

Liu M, Z Ma, T Zheng, W Sun, Y Zhang, W Jin, J Zhan, Y Cai, Y Tang, Q Wu, Z Tang, T Bu, C Li, H Chen (2018). Insights into the correlation between Physiological changes in and seed development of tartary buckwheat (Fagopyrum tataricum Gaertn.). BMC Genomics19; Article 648

Liu YY, XR Chen, JP Wang, WQ Cui, XX Xing, XY Chen, WY Ding, BO God’spower, N Eliphaz, MQ Sun, YH Li (2019). Transcriptomic analysis reveals flavonoid biosynthesis of Syringa oblata Lindl. in response to different light intensity. BMC Plant Biol 19; Article 487

Loizzo MR, T Falco, M Bonesi, V Sicari, R Tundis, M Bruno (2018). Ruta chalepensis L. (Rutaceae) leaf extract: Chemical composition, antioxidant and hypoglicaemic activities. Nat Prod Res 32:521–528

Martin M (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J 17:10–12

Moreton J, A Izquierdo, RD Emes (2016). Assembly, assessment, and availability of de novo generated eukaryotic transcriptomes. Front Genet 6; Article 361

Morgat A, E Coissac, E Coudert, KB Axelsen, G Keller, A Bairoch, A Bridge, L Bougueleret, I Xenarios, A Viari (2011). UniPathway: A resource for the exploration and annotation of metabolic pathways. Nucl Acids Res 40:D761–D769

Musacchia F, S Basu, G Petrosino, M Salvemini, R Sanges (2015). Annocript: A flexible pipeline for the annotation of transcriptomes able to identify putative long noncoding RNAs. Bioinformatics 31:2199–2201

Pan RY, J Ma, XX Kong, XF Wang, SS Li, XL Qi, YH Yan, J Cheng, Q Liu, W Jin, CH Tan, Z Yuan (2019). Sodium rutin ameliorates Alzheimer’s disease-like pathology by enhancing microglial amyloid-β clearance. Sci Adv 5; Article eaau6328

Panwar A, N Gupta, RS Chauhan (2012). Biosynthesis and accumulation of flavonoids in Fagopyrum spp. Eur J Plant Sci Biotechnol 6:17–26

Queller DC, JE Strassmann, CR Hughes (1993). Microsatellites and kinship. Trends Ecol Evol 8:285–288

Rai A, T Nakaya, Y Shimizu, M Rai, M Nakamura, H Suzuki, K Saito, M Yamazaki (2018). De novo transcriptome assembly and characterization of Lithospermum officinale to discover putative genes involved in specialized metabolites biosynthesis. Planta Med 84:920–934

Saxena A, P Singh, DK Yadav, P Sharma, S Alam, F Khan, ST Thul, RK Shukla, V Gupta, NS Sangwan (2013). Identification of cytochrome P450 heme motif in plants proteome. Plant Omics J 6:1–12

Seppey M, M Manni, EM Zdobnov (2019). BUSCO: Assessing genome assembly and annotation completeness. In: Gene Prediction: Methods and Protocols, pp:227–245. Kollmar M (Ed.). Springer, New York, USA

Simão FA, RM Waterhouse, P Ioannidis, EV Kriventseva, EM Zdobnov (2015). BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31:3210–3212

Song L, L Florea (2015). Rcorrector: Efficient and accurate error correction for Illumina RNA-seq reads. GigaScience 4; Article 48

Tautz D (1989). Hypervariability of simple sequences as a general source for polymorphic DNA markers. Nucl Acids Res 17:6463–6471

The Catalogue of Life Partnership (2017). APG IV: Angiosperm Phylogeny Group classification for the orders and families of flowering plants. Available at GBIF.org

The Rnacentral Consortium (2018). RNAcentral: A hub of information for non-coding RNA sequences. Nucl Acids Res 47:D221–D229

Thiel T, W Michalek, R Varshney, A Graner (2003). Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.). Theor Appl Genet 106:411–422

Tian F, DC Yang, YQ Meng, J Jin, G Gao (2019). PlantRegMap: Charting functional regulatory maps in plants. Nucl Acids Res 48:D1104–D1113

Ungaro A, N Pech, JF Martin, RJS McCairns, JP Mévy, R Chappaz, A Gilles (2017). Challenges and advances for transcriptome assembly in non-model species. PLoS One 12; Article e0185020

Wang Z, M Gerstein, M Snyder (2009). RNA-Seq: A revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63

Waterhouse RM, F Tegenfeldt, J Li, EM Zdobnov, EV Kriventseva (2012). OrthoDB: A hierarchical catalog of animal, fungal and bacterial orthologs. Nucl Acids Res 41:D358–D365

Wernersson R (2006). Virtual ribosome—a comprehensive DNA translation tool with support for integration of sequence feature annotation. Nucl Acids Res 34:W385–W388

Williams CR, A Baccarella, JZ Parrish, CC Kim (2016). Trimming of sequence reads alters RNA-Seq gene expression estimates. BMC Bioinform 17; Article 103

Xu Q, LL Chen, X Ruan, D Chen, A Zhu, C Chen, D Bertrand, WB Jiao, BH Hao, MP Lyon, J Chen, S Gao, F Xing, H Lan, JW Chang, X Ge, Y Lei, Q Hu, Y Miao, L Wang, S Xiao, MK Biswas, W Zeng, F Guo, H Cao, X Yang, XW Xu, YJ Cheng, J Xu, JH Liu, OJ Luo, Z Tang, W-W Guo, H Kuang, HY Zhang, ML Roose, N Nagarajan, XX Deng, Y Ruan (2013). The draft genome of sweet orange (Citrus sinensis). Nat Genet 45:59–66

Yi TG, YR Yeoung, IY Choi, NI Park (2019). Transcriptome analysis of Asparagus officinalis reveals genes involved in the biosynthesis of rutin and protodioscin. PLoS One 14; Article e0219973

Yoshida T, Y Watanabe, S Ishiura (2019). Production of the herb Ruta chalepensis L. expressing amyloid β-GFP fusion protein. Proc Jpn Acad Ser B Phys Biol Sci 95:295–302